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Importance  sampling  is  one  of  the  classical  variance  reduction  techniques  for  increasing 
the  efficiency  of  Monte  Carlo  algorithms  for  estimating  integrals.  The  basic  idea  is  to 
replace  the  original  random  mechanism  in  the  simulation  by  a  new  one  and  at  the  same 
time  modify  the  function  being  integrated.  In  this  paper  the  idea  is  extended  to  problems 
arising  in  the  simulation  of  stochastic  systems.  Discrete-time  Markov  chains,  continuous¬ 
time  Markov  chains,  and  generalized  semi-Markov  processes  are  covered.  Applications  are 
given  to  a  GI/G/l  queueing  problem  and  response  surface  estimation.  Computation  of 
the  theoretical  moments  arising  in  importance  sampling  is  discussed  and  some  numerical 
examples  given. 
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1.  INTRODUCTION 

The  use  of  importance  sampling  has  long  been  recognized  as  a  useful  technique  for 
increasing  the  efficiency  of  Monte  Carlo  algorithms  for  numerically  evaluating  integrals. 
Given  the  goal  of  calculating  I  =  f(x)dx,  importance  sampling  involves  choosing  a 
probability  density  g()  and  observing  that  I  can  be  represented  as 

I  =  E{f(X)/g(X)}  (1) 

where  X  is  a  r.v.  with  density  g.  One  then  samples  repeatedly  from  the  density  g,  and 
estimates  I  by  the  sample  mean  of  the  observations  f(X)/g(X).  The  term  importance 
sampling  derives  from  the  fact  that  one  can  choose  g  to  be  large  in  the  regions  that  are  most 
important,  namely  where  |/|  is  largest.  Substantial  efficiency  increases  over  conventional 
Monte  Carlo  are  possible  when  this  technique  is  properly  implemented. 

Our  purpose  in  this  paper  is  to  illustrate  how  the  method  generalizes  to  stochastic 
simulations,  to  develop  some  of  the  basic  theory  relevant  to  this  applications  setting,  and 
to  briefly  describe  some  of  the  promising  research  currently  underway  in  the  area.  The 
rest  of  the  paper  is  organized  as  follows.  As  should  be  clear  from  the  above  discussion, 
the  critical  element  of  importance  sampling  is  the  choice  of  the  density  g.  We  therefore 
discuss  methods  for  constructing  suitable  “densities”  in  the  context  of  discrete-time  Markov 
chains  (Section  2),  continuous-time  Markov  chains  (Section  3),  auid  general  discrete-event 
stochastic  systems  (Section  4).  Section  5  gives  am  abstract  overview  of  the  construction, 
and  relates  the  importance  sampling  density  g  to  the  likelihood  ratio  of  statistics  and  the 
Radon-Nikodym  derivative  of  measure-theoretic  probability.  Section  6  applies  the  ideas  of 
Section  5  to  the  processes  discussed  in  Sections  2-4.  In  Section  7  we  discuss  the  problem 
of  selecting  the  optimal  density  g  over  a  suitably  defined  parametric  family  of  importance 
sampling  densities.  An  application  of  this  idea  to  the  GI/G/l  queue  is  given.  Section  8 
deals  with  the  problem  of  response  surface  estimation.  Here  it  is  shown  that  sometimes  the 
simulator  can  estimate  an  expectation  depending  on  a  parameter  by  only  simulating  at  one 
value  of  the  parameter.  Finally,  in  Section  9  we  discuss  the  computation  of  a  number  of  the 
moments  arising  in  importance  sampling  and  give  a  numerical  example  of  the  technique. 

2.  IMPORTANCE  SAMPLING  FOR  DISCRETE-TIME  MARKOV  CHAINS 

Suppose  that  we  are  assigned  the  following  discrete-time  Markov  chain  simulation 

problem: 


1 


problem: 


Given  a  transition  matrix  P,  initial  distribution  p,  and  read-valued  function  /, 
calculate 

a-E,nX. o.Xi,...) 

where  X  =  {X*  :  n  >  0}  is  the  Markov  chain  associated  with  P  and  m-  (£p( )  denotes 
the  expectation  on  the  path  space  of  X  associated  with  P  and  p>) 


We  will  now  show  how  to  construct  an  analogue  to  the  importance  sampling  density  g  in 
this  problem  context.  Suppose  we  represent  n  and  P  respectively  as  p  =  (m(»)  :  *  €  S)  and 
P  =  (P(i,j)  :  *,j  6  S),  where  S  =  {0,1,...}  is  the  state  space  of  X.  Then,  if  f[Xo,Xx,...)  = 
f(X0,Xi,...,Xn)  (i.e.  /  depends  on  the  trajectory  of  X  only  up  to  the  deterministic  time 
n),  it  follows  that 

a— i 

£  /(«o,...,«»)m(*o)  nP(“-**+»))-  (2) 

ft . .  k=o 

Let  g :  5“+1  -*  R  be  a  probability  mass  function  (p.m.f.)  on  Sn+l  with  the  property  that 

n—  1 

M(io)  n  >  0  implies  . u)  >  o.  (3) 

hmO 

(e.g.  0(*o,...,*n)  =  A(*'o) . . . h(in)  with  h  a  strictly  positive  p.m.f.  on  S).  By  multiplying  and 
dividing  appropriately  in  (2)  (we  need  (3)  to  do  this),  we  obtain 


m(*o)  II  p(tk  .**+ 1) 

a  ~  )  ■  /(V>l  ■  ••!*!»)  »•  •  \  ?(*0l  •  •  •  I  *»») 

Sl*o.  •••.*») 

•0»  •••»■  » 

n- 1 

p{Xo)YlP[Xk,XM) 

=  E'f{X° . Xn)  g(Xo,...,Xn) 


(4) 


when  Eg{  )  is  the  expectation  over  trajectories  X  having  p.m.f.  g\  formula  (4)  is  the  discrete¬ 
time  Markov  chain  analogue  to  (1). 

By  relation  (4),  a  is  representable  as  the  expectation  of  a  r.v.  U  taken  with  respect 
to  E9(  ).  This  suggests  that  a  may  be  estimated  by  repeatedly  sampling  the  r.v.  X  = 
pTo,...,  Jfn)  from  the  p.m.f.  g  and  forming  a  sample  mean  of  the  observations  Ui  =  U(Xi)  so 
generated.  To  analyze  the  efficiency  of  the  new  estimator,  consider  a  budget  t  representing 


the  amount  of  computer  time  assigned  to  numerical  evaluation  of  a.  Then,  if  ft  is  a  r.v. 
denoting  the  amount  of  computer  time  required  to  generate  Uif 

_  |  n»ax{n  >  1 :  ft  +  •  •  •  +  ft.  <  »}.  ^  4  j5) 

is  the  number  of  tf.’s  generated  by  time  t.  Thus,  the  estimator 

jjj (») 

is  the  estimator  available  after  t  units  of  computational  effort  have  been  expended.  By 
appealing  to  Section  5  of  Glynn  and  Whitt  (1986),  the  following  result  is  easily  proved. 


Theorem  1.  If  Eg(U?  +  ft)  <  oo,  then  -  a)  =>  a(g) JV(o,  l)  as  t  -*  oo,  where  <r'(g)  = 

Et0i  •  var#  Ux. 

Note  that  if  we  choose 

n— 1 

$(*0,..-,*»)  =  M(*o) 

fc=0 

then  the  above  importance  sampling  algorithm  with  g  =  $  reduces  to  the  conventional 
Monte  Carlo  procedure  for  calculating  a.  Thus,  we  obtain  an  improvement  in  efficiency  by 
using  importance  sampling  p.m.f.  g  provided  that  cr^fy)  <  a2  (5). 

Much  of  the  previous  literature  on  importance  sampling  assumes  that  £„ft  is  indepen¬ 
dent  of  g;  in  this  case,  efficiency  is  maximized  by  reducing  the  variance  var„  ft  to  its  minimal 
level.  Clearly,  the  assumption  that  the  expected  computational  effort  Egpx  is  insensitive 
to  g  is  especially  unpalatable  in  this  problem  setting.  In  fact,  we  would  typically  expect 
here  that  Eap j  will  be  a  major  component  of  <r3(y),  particularly  when  g  is  a  joint  p.m.f.  on 
S"+1  from  which  it  is  expensive  to  generate  variates.  However,  it  does  seem  reasonable  to 
expect  that  EgPi  m  Etpi  when  g  is  itself  a  Markov  probability  in  S"+l,  by  which  we  mean 

that  there  exists  an  initial  distribution  v  and  transition  matrices  Kk  such  that 

»- 1 

y(*Oi  •  •  •  .*n)  =  v(»'o)  •  JJ  Kk(*h,ik+l)-  (7) 

hmo 

Of  course,  in  order  that  g  satisfy  (3),  it  is  necessary  that  /.(»)  >  0  imply  v(»)  >  0  and  P(i,j)  >  o 
imply  Kk{i,j)  >  0. 

The  probability  g  is  said  to  be  time-homogeneous  Markov  if  K,  =  K  for  j  >  0.  In  this 
case,  a  =  E/cflX0,...,Xn)LH(P,K),  where  Ek  *  Et  with  g  given  by  (7)  and 


Ln(P,K) 


m(*o)  tt  P(Xk,Xk+l) 
v(Xo)  ^}0K(Xk,Xk^Y 


(8) 
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Note  that  to  generate  a  r.v.  X  from  the  Markov  probability  g  given  by  (7),  we  generate 
Xo  according  to  v  and  recursively  generate  the  remaining  X<’s  from  the  transition  matrices 
Kk-  Thus,  the  effort  required  to  generate  X  under  such  a  Markov  probability  should  be 
roughly  comparable  to  that  required  to  generate  X  under  P.  However,  because  of  the 
particular  simplicity  of  generating  from  time-homogeneous  Markov  probabilities,  we  shall 
restrict  our  attention  to  this  class  of  sampling  “densities”  g  for  the  remainder  of  this  section. 

We  now  turn  our  attention  to  the  class  of  estimation  problems  a  =  EPf(X0, Xlt...) 
for  which  there  exists  no  integer  n  such  that  f(X0,X i,...)  =  f(X0,X lt...,Xn).  This  occurs, 
for  instance,  when  we  wish  to  calculate  a  =  £>7\  when  T  =  i nf{n  >  0  :  Xn  €  .4};  such 
problems  arise,  for  example,  when  it  is  necessary  to  calculate  the  expected  time  until  a 
buffer  overflows  in  a  manufacturing  system.  (Here,  A  is  the  set  of  states  corresponding  to 
a  buffer  overflow.) 

Suppose  that  /(X0,Xi,...)  =  /(Xo . Xr)  where  T  is  a  stopping  time  for  the  process  X. 

(A  stopping  time  is  a  random  time  with  the  property  that  the  event  { T  =  n}  is  determined 
completely  by  the  history  of  (X0l . . . ,  X„);  see  p.  118  of  Qinlar  (1975)  for  details.)  Our  next 
proposition  shows  that  (8)  extends  in  the  natural  way  to  estimating  expectations  of  this 
form. 


Proposition  1.  Suppose  that  EP\f(X0 . XT)|  <  oo  with  T  a  stopping  time.  Then 


where 


EPf(X0,...,XT)  =  EKf(X0 . Xt)Lt(P,K) 


Lt(P,K)  = 


n(X0)Tfrl  P(Xk,  Xk+i) 
•'(Xo)  u  K(x*,xMy 


Proposition  1  shows  that  importance  sampling  extends  easily  to  the  case  where  / 
depends  on  X  up  to  a  stopping  time  T.  Efficiency  increases  are  obtained  by  appropriately 
choosing  «/  and  X,  and  then  simulating  X  under  initial  distribution  v  and  transition  matrix 
K\  Proposition  1  is  then  used  to  construct  the  sample  mean  estimator. 

We  conclude  this  section  with  a  description  of  how  importance  sampling  extends  to 
steady-state  estimation.  We  recall  that  the  steady-state  estimation  problem  requires  cal¬ 
culating  a  =  EP /(X0l  Xi, . . .)  where 

l  "-l 

f(x0,xu...)  =  lim  -2>(**)-  (») 

„_oo  n 
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i.e.  f(X0,Xi,...)  is  the  long-run  average  of  (A(Xn)  :  n  >  0}  for  some  function  h.  A  naive 
approach  to  calculating  a  would  involve  extending  Proposition  1  to  functions  /  defined 
through  (9).  In  other  words,  one  might  hope  that 

i  n~1 

a  =  Ek  lim  L  £  *(Xk)  Ioo(P,  K),  (10) 

n  fc.O 

where  Lco(P,K)  =  limn-oo  Ln(P,K).  However,  it  turns  out  that  (10)  is  false  in  general,  and  so 
we  conclude  that  importance  sampling  can  break  down  for  functionals  /  which  depend  on 
the  infinite  history  of  the  process  X  =  {*»  :  n  >  0}.  Since  this  observation  is  an  important 
one,  we  offer  an  elementary  demonstration  that  (10)  is,  in  general,  false.  (In  the  language 
of  measure-theoretic  probability,  (10)  breaks  down  because  the  measures  on  the  infinite 
path  space  of  X  associated  with  P  and  K  are  mutually  singular;  see  Section  5  for  more 
details.) 

Suppose  that  h  is  strictly  positive  and  that  A-  is  an  irreducible  Markov  chain  on  finite 
state  space  S  under  transition  matrix  P.  Clearly,  a  is  then  positive.  We  shall  see,  however, 
that  the  right-hand  side  of  (10)  vanishes,  unless  P  =  K.  To  prove  this,  it  suffices  to  show 
that  LoolP./f)  =  0  under  P*( ).  (PK( •)  is  the  probability  corresponding  to  £*(•).)  Clearly,  the 
result  is  immediate  if  P(i,j)  vanishes  when  K{t,j)  is  positive,  since  the  irreducibility  and 
finite  state  space  guarantees  that  (A„,Xn+i)  will  eventually  visit  such  an  (i,j)  under  P/c (  ) - 
Otherwise,  observe  that 


L°°lP'Ki  “  [eV*'***'! 

when  <!>(%,  j)  =  log(P(t,j-)/#f (»,»).  But 

A  £  pWiiJWJ) 

n  t?o  i.jes 


(11) 


(12) 


P# c  a.s.,  where  p  =  (p(») : »  e  S)  are  the  stationary  probabilities  of  K.  By  the  strict  concavity 


of  hg(  ), 


52  p(*)^(».i)  log 

t.ies 


=  0 


with  strict  inequality  when  the  points  P[i,j)/K(i,j )  are  distinct  (i.e.  P[i,j)  ^  X(i,j)).  By 
(12),  Ek1  *(**,**♦,)  -  -oo  and  thus  LW(P, K)  =  0  by  (11). 

In  spite  of  the  above  difficulty,  it  turns  out  that  importance  sampling  can  be  applied 
to  the  steady-state  estimation  problem.  The  idea  is  to  use  the  regenerative  structure 
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of  X  to  reduce  the  infinite-horizon  steady-state  behavior  to  a  finite-horizon  analysis  over 
regenerative  cycles.  To  be  precise,  for  a  given  (but  arbitrary)  state  t  €  S,  let  r(»)  = 
inf{n  >  1 :  Xn  =  »},  and  set 

T(0-1 

y=  E  m*.) 

km  o 

E  IM*»)I 

k»0 

r  =  T(i). 

It  is  well  known  that  the  steady-state  mean  of  h  can  be  expressed  as  a  =  EP{Y\X0  =  »  } 
/£/>{r|Xb  =  »'}.  By  observing  that  T  is  a  stopping  time  and  applying  Proposition  1,  we 
obtain  the  following  result. 

Proposition  2.  Suppose  X  is  a  positive  recurrent  irreducible  Markov  chain  under  transi¬ 
tion  matrix  P.  If  EP{?\X0  =  *}  <  oo,  then 

v  1  v  *  _  Ek{YLt{P,K)\X0  =  ,} 

'  JS.  n  £>(  )  * 

A  second  representation  for  the  steady-state  mean  a  can  also  be  derived.  Observe  that 
K  is  expressed  as  a  sum,  and  note  that 

{eW(™} = p*  ■  g  }  ■ 

It  is  easily  verified  that 

*{t**8H . *1- 

and  hence,  it  follows  by  conditioning  on  X0,  ...,Xt  that 

Ek  |EM^)^n(P,A)J  =  Ek  jx>(*<)MP.*)} . 

The  above  argument  easily  extends,  as  in  the  proof  of  Proposition  1,  to  stopping  times  T. 
Applying  the  stopping  time  version  to  the  regenerative  representation  of  the  steady-state 
mean  yields  the  following  result. 

Proposition  3.  Suppose  A  is  a  positive  recurrent  irreducible  Markov  chain  under  transi¬ 
tion  matrix  P.  If  EP{t\X0  =  t}  <  oo,  then 


•••••**  j 


Ep  Um  -EM*») - - T1 

hm°  Ek  K)\Xq  = « j 


Aa  a  consequence  of  of  Propositions  2  and  3,  we  see  that  the  steady-state  mean  a  can 
be  expressed  as  the  ratio  of  two  expectations;  standard  methods  can  then  be  applied  to 
estimate  a.  In  particular,  suppose  {((Vi),,,  (r L)n)  :  n  >  1}  is  a  sequence  of  i.i.d.  replicates 
of  (YLr(P,K),  tLt(P,K))  generated  under  Px(  ).  Then,  the  estimator  ri(t)  for  a  is  the  ratio 
estimator,  available  after  t  time  units  of  computation  have  been  expended,  of  the  sample 
mean  of  (YL)n' s  to  (»\L)n’s.  Similarly,  the  ratio  estimator  r2(t)  can  be  constructed  from 
Proposition  3.  Let  0(1),  0(2)  be  the  computation  time  required  to  generate  the  random 
vectors  (YLr(P,K),  tLt(P, A)), (E^o  h(Xt)Le(P, K), Lt(P, K)),  respectively,  under  PK(). 
The  following  result,  which  follows  immediately  from  Section  5  of  Glynn  and  Whitt  (1986), 
describes  the  rates  of  convergence  of  the  two  steady-state  estimators. 


Theorem  2.  Suppose  X  is  a  positive  recurrent  irreducible  Markov  chain  under  transition 
matrix  P.  If  EK{(?2  +  r2)L2 \X0  = »}  <  oo  and  EK(0(  1)  +  0(2))  <  oo,  then 


tI/a(r,(t) -<*)=►*  JV(O.l) 

as  t  -» oo,  for  t  =  1, 2,  where 

a?  =  EK(0(1)\XO  =  i)  Ek{(Y  -  ar)*L*t{P,K) \X0  =  *}/(Ek{tLt(P,K)\X0  =  i})2 


<rl  =  EK{0(2)\Xo  =  i) 


As  in  Theorem  1,  confidence  intervals  for  a  follow  easily  from  this  central  limit  theorem. 
Theorem  2  raises  the  question  of  which  of  the  two  steady-state  estimators  is  more  efficient. 
Note  that  the  proofs  of  Proposition  2  and  3  show  that 


Ek  | 

(E(M^)-c)Mp.a-)) 

\fc«0  / 

1*0  =.j 

(e„\ 

(r-l 

l«=0 

»• 

Ek{tLt(P,K)\Xo  =x}  =  Ek  |2  Lt(P,  K)\Xo  =  .  J  = 


EP(r\X 0  = »}. 


Thus,  if  0(1)  is  approximately  equal  to  0(2)  (which  seems  reasonable  in  most  applications), 
r3(t)  is  more  efficient  than  ri(t)  if  (and  only  if) 

E K  |  (l>(*<)  -  a)Lt(P,K)^j  \X0  =  .  J  <  Ek  |  ^T>(**)  -  «)j  L*r(P,K)\X0  =  *  j  .  (13) 

The  choice  of  return  state  *  is  another  natural  question  which  arises  in  the  regenerative 
context.  If  one  assumes  that  the  computational  effort  equals  r  (this  is  a  common  assump¬ 
tion  in  regenerative  analysis),  then  it  is  well  known  (see  Crane  and  Iglehart  (1975))  that 


the  choice  of  return  state  does  not  affect  the  efficiency  parameter  a1  when  conven¬ 
tional  regenerative  output  analysis  is  used.  On  the  other  hand,  the  choice  of  return 
state  t  does  have  an  effect  on  the  asymptotic  efficiency  in  the  importance  sampling 
case  currently  under  study  here;  an  example  illustrating  this  appears  in  Section  9. 
Although  we  have  no  mathematical  theory  to  guide  the  choice  of  return  state,  it 
seems  reasonable  to  employ  the  heuristic  of  choosing  that  state  i  with  the  shortest 
expected  return  time  under  Px(). 

3.  IMPORTANCE  SAMPLING  FOR  CONTINUOUS- TIME  MARKOV 
CHAINS 

In  this  section,  we  show  how  to  extend  the  ideas  of  Section  2  to  continuous-time 
Markov  chains.  To  be  precise,  given  a  non-explosive  continuous-time  Markov  chain 
X  =  (X(t) :  t  >  0}  with  (conservative)  generator  Q  and  initial  distribution  #i,  our  goal 
is  to  apply  importance  sampling  to  the  estimation  of 

«  =  EqHX), 

where  /  is  real- valued.  (Eq(  )  corresponds  to  the  expectation  on  the  path-space  of 
X  associated  with  using  generator  Q,  when  initialized  with  m). 

As  in  Section  2,  it  is  possible  to  do  importance  sampling  in  which  the  sampling 
distribution  does  not  correspond  to  a  continuous-time  Markov  chain.  However,  as 
argued  in  Section  2,  it  seems  reasonable  to  restrict  ourselves  to  using  sampling 
distributions  which  are  themselves  Markov,  and  we  shall  do  so  here. 

The  simulation  of  a  continuous-time  Markov  chain  generally  proceeds  on  the 
time  scale  of  state  transitions.  By  this,  we  mean  that  any  event-driven  algorithm 
for  generating  the  sample  path  of  such  a  process  will  evolve  iteratively  by  determin¬ 
ing  the  “next”  state  and  corresponding  holding  time  for  that  state.  Thus,  after  n 
iterations  of  the  algorithm,  we  have  simulated  the  process  up  to  time  S~ ,  where  Sn  is 
the  instant  at  which  the  chain  enters  the  (n  +  l)’st  state.  In  other  words,  the  natural 
time  scale  for  discrete-event  simulation  of  such  a  chain  is  the  time  scale  correspond¬ 
ing  to  {S„  :  n  >  0}.  As  a  consequence,  we  shall  initially  concentrate  on  studying 
estimation  problems  for  which  }(X)  can  be  expressed  as  /(A(t) :  0  <  t  <  S„)  for  some 
n  >  0.  Note  that  such  a  function  /  can  be  re-expressed  as  /(t/o,Vb,l/i)Vl...,C/n,V,,l), 
where  t/<  is  the  »’th  state  visited  by  the  chain  and  is  the  time  spent  in  that  state. 


8 


Recall  that  {Un  :  n  >  0}  is  a  discrete-time  Markov  chain  with  transition  matrix  R  defined 
by  R(i,j)  =  Q(i,j)/q{*)  for  *  *  j  and  =  0,  where  9(»)  =  -<?(*,*)•  Furthermore,  conditional 
on  {Un  :  n  >  o>,  the  V„’s  are  independent  exponential  r.v.’s  for  which  the  (conditional)  mean 
of  Vn  is  l/q(Un).  From  these  observations,  it  is  immediate  that 


a  =  EQf(X(t)  :  0  <  t  <  Sn) 

=  Eq/(Uo,Vo . Un,Vn) 

roo  rca  n~1  * 

~  ^2  /(«0. vo . ««, «»)m(«o)  •  n  *(«#• «y+i)  II «p(-g(u,>, )dv° ...dvn 

. . U.J°  Jo  J=0  j=0 

»  \ 

-  9(uy)wy  j  rfuo,  ...,dvn. 

}=o  ; 

(14) 

Suppose  that  we  wish  to  estimate  a  by  simulating  a  chain  with  a  different  initial  distribution 
1  and/or  generator  A.  We  assume  that  A  is  non-explosive  and  conservative  and  that  the 
analogue  of  condition  (3)  holds: 


fOO  0OO  / 

=  V  /  /  /(t4o,Vo . «n,%M“o)n<5(1']'tt>>l)«P 

. . «.-/°  •'<’  y=o  V 


p(t)  >  0  implies  t(»)  >  0  and 
<?(»,/)  5^  0  implies  A(i,j)  ±  0. 


(15) 


From  (14),  it  is  evident  that  an  appropriate  multiplication  and  division  yields 

«  =  EAf[U0,V i, . Un,Vn)Ln(Q,A) 

—  EAf(X(t)  :  0  <  t  <  Sn)Ln(Q,  A) 


where 

-)  -  g  n  -»  (/><*<*»  -  **!•»*)  t“l 

and  a(t)  =  —  j4(*,»).  Hence,  a  can  be  estimated  via  importance  sampling.  We  simulate  the 
chain  under  generator  A,  as  initialized  by  t,  up  to  time  S„.  At  the  end  of  the  run,  the 
quantity  f(X(t) :  0  <  t  <  Sn)Ln(Q,  A)  is  calculated.  By  independently  replicating  these  runs 
many  times,  we  can  estimate  a  by  a  sample  mean  of  the  /(X(t)  :  0  <  t  <  Sn)Ln(Q,AYs  so 
obtained. 

As  in  Section  2,  we  now  wish  to  extend  our  importance  sampling  methodology  to 
more  general  functions  }(X).  Suppose  that  T  is  a  stopping  time  relative  to  the  sequence 
{ ( t/* ,  V* )  :  *  >  0},  by  which  we  mean  that  the  event  {r  =  n}  is  completely  determined  by 
(t/o,V0),...,(t/„,Kn).  The  next  result  follows  from  (16)  in  a  manner  similar  to  the  way  in 
which  Proposition  1  followed  from  (8). 
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Theorem  3.  If  :  0  <  t  <  Sr)|  <  oo  with  T  a  (finite-valued)  stopping  time,  then 

EQf(X(t)  : 0  <  t  <  ST)  =  EAf{X(t)  :0<t<  St)Lt(Q ,  A) 


where 


MQ.A) 


M(Pb)  Vt 

•»(%)  ,y0 


MUj>uj+i) 


exp 


(«(*M)  -  «(*(»)))<* 


An  important  application  of  this  stopping  time  result  occurs  when  one  is  required  to 
estimate  the  transient  behavior  of  the  chain  X  up  to  a  deterministic  time  w.  In  other 
words,  suppose  that  f(X)  =  /( JC(t)  :  0  <  t  <  to),  where  to  is  deterministic.  To  simulate  the 
process  X  up  to  time  to  requires  T(w )  iterations  of  the  event-scheduling  algorithm,  where 
T(to)  =  inf  {n  >  0  :  S0  +  •  +  Sn  >  to}.  Clearly,  f(X(t )  :  0  <  t  <  to)  =-  /(X(t)  :  0  <  t  <  ST{w))  W.p.l. 

(Equality  may  not  hold  if  Sr(w)  =  to,  but  this  happens  with  zero  probability.),  and  hence 


Eqf(X(t)  :  0  <  t  <  to)  =  EAf(X(t)  :  0  <  t  <  w)Lr{w)(Q,  A). 


Formulas  similar  to  this  appear  on  p.  166  of  Bremaud  (1981),  where  they  were  derived 
using  martingale  representations  of  continuous-time  Markov  chains. 

We  turn  now  to  the  steady-state  analysis  of  continuous-time  chains.  As  in  the  discrete- 
time  setting  of  Section  2,  we  exploit  regenerative  structure.  For  a  given  (arbitrary)  state 
in  5,  let  r  =  inf{n  >  1  :  Un  =  »}, 

rSr 

Y=  /  h(X{t))<U 

Y  =  [  '  |h(*(»))|*. 

Jo 

An  analysis  similar  to  that  used  to  obtain  Proposition  2  shows  that  if  X  is  irreducible 
positive  recurrent  and  if  Eq{? |Jf(0)  =  *}  <  oo,  then 


(17) 


Although  (17)  could  be  used  directly  to  obtain  the  steady-state  means,  such  an  algorithm 
would  be  inefficient.  In  particular,  one  can  apply  conditional  Monte  Carlo,  in  the  form 
of  discrete-time  conversion  (see  also  Fox  and  Glynn  (1986)),  to  (17),  thereby  yielding  the 
expression 


E„  lim  -  f  >»(*(*))<k  = 

t  — oo  t  Jo 


( *=0  JmO 

Ea  (  TTuTT  II  r\Uu';Uu’*+\\  lx(°)  =  1 
;*0 


(18) 
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where  =  A(i,j)/a(i)  for  *  /  j  and  #(*,*)  =  0.  The  principle  of  conditional  Monte  Carlo 
guarantees  that  a  ratio  estimator  based  on  (18)  will  have  a  lower  variance  than  one  based 
on  (17);  a  central  limit  theorem  for  this  estimator  similar  to  that  appearing  in  Theorem  2 
can  easily  be  derived.  An  alternative  derivation  of  (18)  rests  on  the  well-known  fact  that 


n—  1 


EqM-  f*  *(*(»))*  = 

t— *oo  t  J o 


n—1 


n— *oo  _ 


k-0 


Proposition  2  can  then  be  applied  to  the  numerator  and  denominator  of  the  right-hand 
side  to  obtain  (18). 

To  conclude  this  section,  we  note  that  the  r.v.’s  appearing  on  the  right-hand  side 
depend  only  on  the  embedded  discrete-time  chain  { Un  :  n  >  o);  this  chain  can  be  simulated 
under  PA(  )  by  using  the  transition  matrix  R'.  Thus,  if  (18)  is  used  to  estimate  the  steady- 
state  mean,  no  exponential  variates  need  be  generated. 


4.  IMPORTANCE  SAMPLING  FOR  GENERALIZED  SEMI-MARKOV 
PROCESSES 

In  the  preceding  two  sections,  we  have  shown  how  importance  sampling  applies  to 
discrete  and  continuous-time  Markov  chains.  We  shall  now  show  how  importance  sampling 
applies  to  general  discrete-event  simulations.  As  in  Section  3,  we  shall  restrict  attention 
to  sampling  distributions  which  themselves  correspond  to  discrete-event  simulations. 

A  mathematical  framework  is  convenient  for  the  purposes  of  this  discussion.  In  particu¬ 
lar,  we  shall  model  a  discrete-event  system  as  a  generalized  semi-Markov  process  (GSMP). 
As  described  in  Glynn  (1983),  a  GSMP  is  basically  a  mathematical  formalization  of  a 
discrete-event  simulation;  the  process  is  driven  by  an  event-scheduling  algorithm,  in  ex¬ 
actly  the  same  way  as  is  a  discrete-event  system. 

To  be  precise,  let  5  be  a  subset  of  the  non-negative  integers  representing  the  state 
space  of  the  GSMP.  The  GSMP  also  requires  specification  of  a  second  finite  integer-valued 
set  E ;  E  corresponds  to  the  set  of  all  possible  events  that  can  initiate  a  state  transition. 
For  example,  in  the  single-server  queue,  E  consists  of  two  events,  one  each  for  the  arrival 
and  departure  processes. 

To  understand  the  probabilistic  dynamics  of  a  GSMP,  recall  that  in  order  to  simulate 
a  discrete-event  system,  we  need  to  schedule  each  event  possible  in  the  state  currently 
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occupied;  we  can  view  the  event  schedule  as  being  represented  by  a  set  of  clocks.  So,  for 
each  event  a  6  E(a)  (the  set  of  events  active  in  state  «  €  S),  suppose  that  r(«,«)  is  the  non¬ 
negative  rate  at  which  the  clock  corresponding  to  event  e  runs  down  in  state  «;  we  assume 
that  max{r(«, e)  :  e  €  E{a)}  >  0  for  each  state  a  6  S  (i.e.  at  least  one  clock  has  a  positive 
speed)  and  max{r(«,e)  «  €  £(«),*  €  S)  <  oo.  Under  this  assumption,  a  clock  will  eventually 
run  down  to  zero,  thereby  initiating  a  state  transition.  If  two  or  more  clocks  run  down  to 
zero  simultaneously,  the  clock  with  the  minimal  index  «  is  regarded  as  the  “trigger”  event. 
The  next  state  $'  is  then  chosen  with  probability  p(a';«,e*),  where  a  was  the  previous  state 
and  event  «*  6  F(a)  was  the  event  that  triggered  the  transition  there.  In  state  a',  the  clocks 
in  state  a  that  are  still  active  in  a'  continue  to  run  down  (now  according  to  rates  *•(«',  e)); 
the  set  of  these  “old”  clocks  is  the  set  0(a\a,e‘)  =  £(a')n(£(a)~  {«*}).  In  general,  when  state 
4*  is  entered,  new  events  (represented  by  the  set  N(a',a,e*)  =  E(a')  -0(a',a,e*))  need  to  be 
scheduled.  The  clock  on  event  «  e  N(a',a,e‘)  is  set  (independently  of  the  past)  according 
to  a  probability  distribution  F( ;  a',  e,  «,«*);  we  require  that  F(0;a',e,  a,e*)  =  0  (clocks  are  set 
at  strictly  positive  values).  The  clocks  in  £(4')  are  then  allowed  to  run  down,  and  another 
state  transition  occurs;  the  above  process  is  then  repeated  indefinitely.  The  state  occupied 
at  time  t  will  be  denoted  as  X(t)]  the  process  :  t  >  0}  is  then  the  desired  GSMP.  It 
should  be  clear  that  the  above  description  of  a  GSMP  closely  mimics  the  evolution  of  a 
discrete-event  system.  For  more  details  on  the  specification  of  a  GSMP,  see  Burman  (1981) 
or  Whitt  (1980). 

As  in  Section  3,  it  is  convenient  to  view  the  GSMP  on  a  time-scale  which  is  natural  to 
the  simulator,  namely  {S„  :  n  >  0},  where  Sn  is  the  instant  at  which  the  GSMP  enters  the 
(n+  i)’st  state  visited.  Let  {(An,Cn) :  n  >  0}  be  the  sequence  in  which  An  represents  the  n’th 
state  visited  by  the  GSMP  and  Cn  -  (C„(e)  :  e  e  E)  is  the  vector  of  clock  readings  at  the 
instant  of  entry  into  state  An.  (If  a  clock  is  inactive  in  state  An,  the  clock  is  set  to  +00.) 
Clearly,  the  trigger  event  emn  in  An  is  a  function  of  An  and  Cn  namely 

«;  =  min{«'  €  E  :  Cn(e')/r{An, «')  =  min{Cn{e)/r(An,  e)  :  e  6  E}}. 

Note  that  5*  =  ££«0c*(*D/rM*>*!)-  A  little  thought  then  shows  that  if  f[X{t)  :  t  >  0)  = 
f(X(t) :  0  <  t  <  Sn),  /  can  be  represented  as  f(Ao,C0,...,An,Cn).  A  further  simplification  is 
possible,  however.  Observe  that  C„  =  en(A0,C0,---,An,cn)  (with  c„()  deterministic),  where 
Ci  »  (C<(«)  :  a  6  E)  is  the  vector  of  “new"  clock  readings  generated  at  the  n’th  transition, 
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namely 


So 


/  Co(«),  «  €  E(Ao) 
\  +oo,  e  £  E(Ao) 


and 


_/^«(e)i  *  €  N(An,  An-i,  j ) 

~  -l+oo, 

for  n  >  l.  We  conclude  that  there  exists  a  (deterministic)  function  /  such  that 
f(X(t)  :  0  <<<$„)  =  /Mo,C’o,.-.,i4»1Cn).  This  representation  of  /  will  now  be  used  to 
develop  importance  sampling  methodology  for  a  GSMP  estimation  problem  of  the  form 


a  =  EFf(X(t)  :  0  <t<Sn).  (19) 

(Ef(  )  is  the  expectation  on  the  path  space  of  the  GSMP  associated  with  transition  probabil¬ 
ity  p(a';  »,«*),  clock-setting  distributions  /’(•;  a', e,  «,«*),  and  initial  distribution  p;  we  assume 
that  the  initial  distribution  takes  the  form 


P{Ao  =  *,Co(«)  e  dx,,e  S  2?(a)}  =  p(a)  ■  F(dxe,e). 

e€S(.) 

As  one  might  expect  from  our  discussion  of  discrete  and  continuous-time  chains,  an 
analogue  to  conditions  (3)  and  (15)  will  be  needed.  Suppose  that  we  wish  to  use  a  sampling 
distribution  which  corresponds  to  a  GSMP  having  identical  state  space  S ,  event  space 
E,  rates  r(3,e),  but  possessing  (possibly)  different  transition  probabilities  g(s',e,e*),  clock¬ 
setting  distributions  H(-\a',e,s,  «*)  and  initial  distribution  7  (7  is  characterized  by  initial 
state  probability  7(a)  and  initial  clock-setting  distributions  H[dxc,e).)  The  data  of  the  new 
GSMP  must  satisfy: 

1. )  p(a';a,e*)  >  0  implies  g(a';a,«*)  >  o  and  p(a)  >  0  implies  7(a)  >  0. 

2. )  there  exists  functions  Jfc(  ;a',«,a,e*)  and  jfc( ,«)  such  that 

E(dx;  a',e,  a,e*)  =  fc(x;  a',  e,  a,  em)H(dx;  s',  e,  a,  «*) 


F(dx,e)  =  k(z,e)H(dx,e). 


Since  the  first  condition  is  self-explanatory,  we  concentrate  on  explaining  the  second  con¬ 
dition  above.  It  basically  says  that  wherever  H  assigns  no  mass  (i.e.  H{dx)  =  0),  F  must 
assign  no  mass  ( F(dx )  =  0).  In  the  language  of  mathematical  probability,  F  is  said  to  be 
absolutely  continuous  with  respect  to  H .  To  gain  a  better  understanding  of  this  condition, 


suppose  that  F  and  H  both  have  densities  that  are  positive  everywhere.  Then 

*(x;«'e,»,e*)  = 

*(*.«)  =  f(x,e)/h(x,e). 
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However,  F  need  not  have  a  density  in  order  that  condition  2)  hold.  In  particular,  suppose 

that  F  assigns  probability  /*  to  the  point  **  for  k  =  1,2 .  If  H  assigns  mass  hk  to  the 

same  sequence,  then 

*  i  *  )  =  fk/^k- 

Mixed  probability  distributions,  that  contain  both  point  masses  and  densities,  can  be 
similarly  handled. 

We  turn  now  to  the  application  of  importance  sampling  to  the  estimation  problem 
(19).  As  argued  previously,  f(X{t)  :0  <t<Sn)  =  f(A0,Co,---,AnCn),  so 


ot  —  Ef  f  (Ao,  Coi  •  •  • ,  Cn) 

*  H  J  j  /(“<>-£ o.---.aB,£n)M(ao)  n  F(d£o(e),e) 

n  —  1  n 

•  n  p(aJ+i"°y>ey)n  n  j- 

By  multiplying  and  dividing  appropriately  in  the  above  expression  (we  need  1)  and  2) 
above  to  avoid  dividing  by  zero),  we  get 


<*  —  Eh  /(Ao ,  Co . An,  Cn)£n(-^' >  H) 

=  EHf{X(t):0<t<Sn)Ln{F,H) 


where 


Ln  (F,H) 


n(A  o) 
~l{A  o) 

n 


n 

«6E(Xo) 


fc  (Co  («)>«) 


n 


3=0 


p(Ay+i!  A],  e*) 

?(Ay+i;  A],  e}  ) 


(20) 


•II  II  *(C’i(e);A„«lAy.l,.;_l) 

and  j&H (•)  is  the  expectation  on  the  path  space  of  X  associated  with  the  sampling  distribu¬ 
tion  described  above.  The  extension  from  deterministic  n  to  stopping  times  T  (i.e.  {T  =  n} 

is  completely  determined  by  (i40)Co . An,cn)-)  is  straightforward,  thereby  yielding  our 

next  theorem. 


Theorem  4.  If  Er\f[X[t) :  0  <  t  <  Sr)|  <  oo  with  T  a  (finite-valued)  stopping  time,  then 
EFf(X{t)  :0<t<ST)  =  EHf(X{t)  :  0  <  t  <  St)Lt{F,H) 
where  Lt{F,H)  is  the  r.v.  obtained  by  substituting  T  into  (20). 
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As  in  the  continuous-time  chain  setting,  the  most  important  application  of  the  stopping 
time  version  described  above  is  to  functions  /  of  the  form  }(X)  =  f{X(t)  :  0  <  t  <  w).  Set 
T(w)  =  iaf{n  >  0 :  Sn  >  «/},  and  note  that  T(w)  is  a  stopping  time.  It  follows  that 


£rf(X(t)  :0<t<w)  =  EHf(X{t)  :  0  <  t  <  w)Lt{w)(F,H). 


Thus,  to  apply  importance  sampling  to  computation  of  expectations  of  the  form  a  = 
EFf(X(t)  :  0  <  t  <  u>),  it  suffices  to  simulate  the  GSMP  under  PH()  up  to  the  T(ui)’th 
transition.  From  the  path  so  generated,  one  can  easily  recursively  compute  Lt^(F,H), 
thereby  yielding  the  observation  f(X(t )  :0  <  t  <  w)LT(W)(F,  H).  A  sample  mean  consisting  of 
independent  replicates  of  the  chain  is  then  a  consistent  estimator  of  a. 

We  turn  now  to  a  discussion  of  the  application  of  importance  sampling  to  steady-state 
estimation  of  discrete-event  simulations.  To  be  precise,  suppose  that 


f(X)=  lm.  7 

t— »oo  t  Jo 


where  X  is  a  GSMP.  As  discussed  in  Section  2  for  discrete  time  chains,  naive  importance 
sampling  breaks  down  ,  in  the  sense  that  if  one  sets  T  =  oo  in  Theorem  4,  the  result 
becomes  false.  In  fact,  we  have  been  able  to  obtain  an  importance  sampling  estimator  for 
the  steady-state  mean  of  a  GSMP  only  when  the  GSMP  is  regenerative. 

To  obtain  regenerative  structure,  we  assume  that  the  GSMP  hits  a  single  state  infinitely 
often,  by  which  we  mean  that  there  exists  *0,si  6  5,  e0  e  E(s0)  such  that: 

a)  A„+1)  =  0>o,eo,  »i)  infinitely  often}  =  l. 

b)  N(ai,  so,  «o)  =  E(si). 

The  term  “single  state”  refers  to  the  fact  that  if  E(a0)  =  {«o},  then  condition  b)  is  automati¬ 
cally  fulfilled.  By  b),  all  the  clocks  in  E(s i)  are  reset,  independently  of  the  past,  whenever  3! 
is  entered  from  s0  by  the  triggering  of  event  e0.  If  r  =  inf{n  >  l :  =  (soi*o.«i)}> 

it  is  then  immediate  that  ST  is  a  regeneration  time  for  X( ).  Note  that  the  single-server 
queue  is  a  GSMP  with  single  set  (a0  =  0,  e0  =  arrival  event,  «i  =  l),  so  that  the  regenerative 
structure  of  the  G/G/ 1  queue  follows  as  a  special  case  of  this  discussion. 

Letting  Y  =  f^r  h(X(s))ds,  Y  =  /0Sr  |k{X(«))|cfc,  standard  regenerative  analysis  shows  that 
if  EpY  <  oo,  then 


Ep  lim 

t  —+  oo 


EffYLr(F,  H) 

eHstLt(F,H) 


(22) 
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where  Eg  is  the  expectation  on  path  space  in  which  the  initial  distribution  is  given  by 
v(*i)  =  1,  £T( ,«)  =  ;*i,«,*o,«o). 

In  fact,  a  second  ratio  formula  for  a  involving  the  sampling  distribution  H  can  be 
derived;  this  formula  is  the  GSMP  analogue  of  the  discrete-time  chain  result  obtained  in 
Proposition  3.  Observe  that  on  the  event  (r  >  l). 


p(Aj+ 1  ;■*>.«?) 
q(Aj+  i‘,A},e‘) 


n 


^(^>+i  (*);  ^>+1  •  i  ej)\Ao>  Co  i  •  •  • » Af,  Ci 


=  l 


Since  Y  can  be  expressed  as 


[ S'h(X(*))ds  =  Y,h(Ai)Ct(e\)lr(At,e\)t 

ImO 


it  follows,  by  conditioning  the  *’th  summand  of  YLt(F,H )  on  A0,Co,-  ■■  ,At,Ce,  that 


E„{YLT(F,H)}  =  Eg  | ^M^)Ct(«J)L<(F,ir)/r(i4ei«;)J.  (23) 

By  setting  h  =  1  in  (23),  a  similar  formula  for  EStLt(f,H)  is  obtained.  Thus,  if  £^(K  +  Sr)  < 
oo ,  we  find  that 


EH\^h(At)Ct{c-t)Lt{F,H)/r(Atyt)\ 

Er  fim  i J  h(X(s))ds  =  - -1  (24) 

EH\pWt)Lt(F,H)lr(At,e\)\ 

As  a  consequence  of  the  ratio  formulas  (22)  and  (24),  it  is  clear  that  the  steady-state  mean 
a  can  be  estimated  by  simulating  replicates  of  either 


(YLr(F,H),  SrLr(F,H)) 
or 

r-l  t —  1 

Yth{A')Cte)Lt{F,H)/r{At,t\),  £  Ct(e\)Lt(F,  H)/r(At,e-t) 

«*o  <=o 

under  the  sampling  distribution  PgO-  Let  r^t),  jr3 (e)  be  the  two  corresponding  ratio 
estimators  that  are  available  after  t  units  of  computational  effort  have  been  expended, 
and  suppose  0(1), 0(2)  are  r.v.’s  denoting  the  computational  effort  required  for  each  ob¬ 
servational  pair.  By  applying  the  methods  of  Section  5  of  Glynn  and  Whitt  (1986),  the 
following  central  limit  theorems  follow. 
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Theorem  5.  If  EH(fi(l)  +  0{2)  +  (K2  +  S2)L2(F,  H))  <  oo,  then 


t1,2{Zi{t) -*)=*■  Z*N{°<  1) 


as  <  -» oo,  for  i  =  1, 2,  where 


2? 


Eff{(Y-aSr)’mF,H )} 

Efffi(i)  c  r  . 


2§  =  2?tf£(2)  E„  |  -  «)Ct(«2)I*(P.  H)/r(At,  el) )  f  /(EftSrLT(F,  H))2 


As  in  the  discrete-time  Markov  chain  case,  the  denominators  of  the  above  expressions 
for  <t2  and  a\  are  both  equal  to  (EFST)2.  Then,  if  Ejjp(i)  «  Ejjp(2),  13  smaller  than 
if  the  remaining  expression  in  the  numerator  of  a\  is  smaller  than  the  corresponding 
expression  in  £?•  Again,  as  in  the  discrete-time  chain  case,  we  have  no  sufficient  condition 
guaranteeing  that  £2  <  crj. 

We  conclude  this  section  by  illustrating  that  the  estimators  of  this  section  are  generally 
easy  to  calculate.  If  />(•),  PH(  )  correspond  to  G/G/l  queues  (with  identical  initial  distribu¬ 
tions)  having  interarrival  distributions  FA,HA  and  service  distributions  Fs,Hs  respectively, 
then 

Ln{F,H)=  I]  MC,(1))  II  ^(Cy(2)) 

j€A.  ;€A'. 

where  A„  =  {j  :  j  <  n,Sj  is  an  arrival  epoch}.  A*  =  -  An,  and  fcA(  ),  ks()  are  the 

functions  appearing  in  condition  2)  of  this  section  (kA  for  the  arrival  event,  ks  for  the 
departure  event).  Hence, 


in 


L„_l(/’1^)A5(C7n(2)), 


n  €  An 
n  6A‘. 


This  simple  recursive  structure  of  Ln{F,H)  is  typical  of  GSMP’s. 


5.  IMPORTANCE  SAMPLING:  THE  GENERAL  THEORY 

In  the  previous  three  sections,  we  have  shown  how  to  apply  importance  sampling  to 
a  broad  class  of  stochastic  simulations.  The  idea  is  to  find  a  sampling  distribution  from 
which  variates  can  be  cheaply  generated  and  for  which  the  variance  of  the  estimator  is 
reduced.  In  this  section,  we  abstract  the  ideas  of  the  previous  sections  with  the  hope  that 
the  basic  concepts  underlying  importance  sampling  will  become  more  transparent. 
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Given  a  sample  space  0  (with  corresponding  afield  T),  suppose  X  is  a  real-valued  r.v. 
defined  on  fl  for  which  we  are  required  to  calculate 


a  =  EpX 


(25) 


where  EP(  )  corresponds  to  the  expectation  on  Cl  associated  with  the  probability  distribution 
P.  To  define  an  appropriate  sampling  distribution  Q  for  the  estimation  problem  (25),  we 
need  an  analogue  to  the  conditions  above  given  by  (3)  and  (15)  (as  well  as  1)  and  2)  of 
Section  4).  It  turns  out  that  the  abstract  form  of  these  conditions  is: 


P(A)  >  0  implies  Q(A)  >  0 
for  every  A  e  7. 


(26) 


In  the  language  of  mathematical  probability,  (26)  states  that  P  is  absolutely  continuous 
with  respect  to  Q.  The  Radon-Nikodym  theorem  then  guarantees  the  existence  of  a  “den¬ 
sity”  Lq  such  that 

P(A)  =  j  Lq(w)Q(<L>), 
from  which  it  follows  that  if  EP\X\  <  oo, 


EPX  =  EqXLq.  (27) 

Lq  is  called  the  Radon-Nikodym  derivative  of  P  with  respect  to  Q;  in  the  language  of 
statistics,  L  is  known  as  the  likelihood  ratio  of  P  with  respect  to  Q  (hence,  our  use  of  the 
notation  L).  The  derivative  Lq  appears  in  all  the  main  formulas  of  Sections  2  through  4 
(e.g.  (8),  (16),  (20)),  in  the  form  of  the  r.v.’s  Ln,Lr ,  etc.  appearing  there. 

Although  condition  (26)  was  the  one  used  to  construct  sampling  distributions  in  Sec¬ 
tions  2  through  4,  it  turns  out  that  importance  sampling  can  be  generalized  somewhat 
beyond  (26).  Assume  that  Q  satisfies: 

E/»|X|/(A)  >  0  implies  Q(A)  >  0 

(28) 

for  every  A  e  f. 

Since  the  set  of  A’s  for  which  EP\X\I(A)  >  0  is  contained  in  the  set  for  which  P(A)  >  0,  it 
follows  that  (28)  is  a  weaker  condition  in  Q  then  (26).  By  the  Radon-Nikodym  theorem, 
(28)  suffices  to  guarantee  existence  of  L‘Q  such  that 

EpXI(A)  =  Lq{<t))Q(<Lj).  (29) 


18 


In  this  case,  a  =  EqLq.  Here,  Lq  has  no  interpretation  as  a  likelihood  ratio. 

To  compute  the  efficiency  of  an  importance  sampling  estimator  based  on  (29),  we  let  0 
denote  a  r.v.  representing  the  computation  time  required  to  calculate  Lq  under  sampling 
distribution  Q.  If  r(t)  is  the  estimator  available  after  t  units  of  effort  have  been  expended 
(see  (5)  and  (6)  for  an  explicit  description),  then  Eq(/3  +  Lq)  <  oo  implies  that 

^(rfO-al^^Wl) 

where  a2(Q)  =  EQp  varg  Lq.  Ideally,  we  would  like  to  choose  Q  to  minimize  cr2(Q).  This 
problem  is  difficult  to  solve  mathematically  since  the  quantity  EqP  depends  not  only  on  the 
“complexity”  of  the  sampling  distribution  Q  but  also  on  the  efficiency  of  the  programming 
implementation.  As  a  result,  most  analyses  of  this  problem  disregard  the  EqP  term,  and 
concentrate  only  on  reducing  the  variance  varg  Lq. 

To  minimize  varg  Lq  subject  to  (29) ,  observe  that  the  Cauchy-Schwartz  inequality  yields 

wqLq  =  Eq(Lq)2  -  a2 

>(Eq\L'q I)2 -a2 
=  EQ(LQY-o?  =  ™QL'Q 

where  <J(A)  =  EP\X\I(A)/EP\X\.  Hence,  Q  is  the  sampling  distribution  which  minimizes 
varg  Lq.  This  result  is  well  known  in  the  case  where  tl  =  St  (see,  for  example,  Rubinstein 
(1981),  p.  122),  where  it  is  often  pointed  out  that  Q  is  not  implementable  practically 
since  its  construction  requires  knowledge  of  EP\X\,  which  is  generally  unknown.  The  same 
proviso  holds  here,  of  course.  Instead,  one  can  exploit  this  result  by  choosing  Q  to  minimize 
Q  as  closely  as  possible.  Hence,  the  sampling  distribution  Q  should  inflate  the  probability 
mass  assigned  by  P  where  |Aj  is  large,  and  deflate  it  where  |Aj  is  small.  Accomplishing  this 
in  practice  is  hard,  however,  and  must  generally  be  guided  by  sophisticated  theory  (see 
the  “large  deviations”  applications  of  Section  7). 

The  form  of  the  optimal  Q  does  provide  us  with  some  limited  information  of  general 
applicability.  Specifically,  Q(A)  is  positive  only  when  EP\X\I(A)  >  0.  This  implies  that 
if  P(i,j)  =  0,  then  any  Markov  probability  used  for  importance  sampling  should  satisfy 

Before  proceeding,  it  is  worth  noting  that  (27)  can  be  viewed  in  a  somewhat  different 
way.  Given  a  sampling  distribution  Q,  (27)  shows  that  for  every  r.v.  X  defined  on  n  and 
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every  probability  P  which  is  absolutely  continuous  with  respect  to  Q,  (27)  holds.  Thus,  by 
generating  a  sample  point  u  in  0  under  Q,  we  simultaneously  obtain  an  unbiased  estimate  of 
every  expectation  a  =  a(X,P)  of  the  form  a(X,  P)  =  EPX.  For  example,  when  one  generates 
a  path  of  a  Markov  chain  having  transition  matrix  Pi,  one  is  actually  obtaining  unbiased 
sampling  information  about  every  Markov  chain  having  transition  matrix  for  which 
Pt[i,j)  =  0  whenever  Pi(i,j)  =  0. 

We  will  now  show  that  when  Q  is  chosen  to  satisfy  the  stronger  condition  (26)  (as 
occurs  in  Sections  2-4),  importance  sampling  can  be  combined  with  the  method  of  control 
variates  to  further  reduce  the  variance.  The  main  observation  is  that  by  setting  X  =  l  in 
(27),  we  find  that  EqL  =  1.  Hence,  Z[ A)  =  XL-X{L- 1)  has  expectation  a  under  Q.  The  basic 
theory  of  control  variates  then  shows  that  if  Eq(X*  + 1  )L3  <  oo,  then  va rQ  Z{\)  is  minimized 
by  choosing  A  =  A* ,  where 

A*  =  covq(XL,  L)/wziqL 


(assuming  var<j  L  >  o),  in  which  case  va tq  Z( A*)  =  vai-Q  XL  -  [covq(.XX,  L)\2 /  var<j  L.  Of  course, 
in  a  practical  implementation,  A*  must  generally  be  estimated  from  sample  data. 

One  more  point  must  be  emphasized  here.  The  use  of  importance  sampling  can  fre¬ 
quently  lead  to  estimators  with  infinite  variance  (see  Proposition  6  for  example);  finite 
variance  must  not  be  taken  for  granted  in  this  setting.  This  occurs  despite  the  fact  that 
importance  sampling  gives  rise  to  estimators  with  finite  first  moment  whenever  the  original 
estimation  problem  has  finite  expectation.  Hence,  if  the  sampling  distribution  Q  is  poorly 
chosen,  importance  sampling  can  lead  to  an  estimation  procedure  with  exceptionally  poor 
convergence  properties. 

As  discussed  above,  sampling  under  Q  provides  information  about  all  probabilities  P 
which  satisfy  (26).  As  we  shall  now  see,  Q  can  often  also  be  used  to  generate  sample 
outcomes  in  n  having  distribution  P.  Suppose  the  likelihood  ratio  Lq  is  bounded  by  a 
(deterministic)  constant  AT.  If  W  is  a  random  point  in  n  generated  under  Q  and  if  U  is  an 
independent  uniform  (0, 1)  r.v.,  then 


QOV  €  du\MU  <  L{W) j  = 


EqEq{I{W  edu>,MU  <  L(W)\W } 
EqEq{I(MU  <  L(W)\W) 


=  EqI(W  £  du)L{w)  =  P(du). 


Thus,  conditional  on  the  event  {MU  <  L(W)},  W  has  distribution  P.  This  is  the  method  of 
acceptance-rejection  extended  from  n  =  R  to  arbitrary  fl  (see  p.  50  of  Rubinstein  (1981)). 
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We  can  illustrate  the  technique  with  an  application  to  the  M/M/l/oo  queue  with  time- 
varying  arrival  and  service  time  rates.  Specifically,  we  suppose  that  the  arrival  and  service 
time  rates  at  time  t  are  given,  respectively,  by  A(t)  and  n(t)  where  0  <  A (t),n(t)  <  K.  Our 
goal  is  to  generate  sample  paths  of  an  M/M/l/oo  queue  with  these  rates  up  to  time  S„, 
where  S„  is  the  entry  instant  of  the  (n  +  l)’st  state  visited. 

If  Q  is  the  probability  corresponding  to  a  queue  in  which  the  rates  are  set  equal  to  A 
and  n  for  all  t  >  0,  then  Bremaud  (1981),  p.  168,  shows  that 

lq=  n  (nr)  n  «p  f-  fs\x(s)-x+ (M(5)  -lt)HQ(a)>0))ds 

,€A.  V  A  '  y€A«  v  ^  /  L  ■/o 

where  <J(s)  is  the  queue-length  at  time  a,  A„  =  {;  <  n  :  S}-  is  an  arrival  epoch},  and  A'  = 
A*.  If  X,n  are  lower  bounds  on  the  functions  A(  ),n( )  over  [o,t],  it  follows  that  LQ  < 
K2n/(X/i)n;  the  above  acceptance-rejection  method  then  applies  to  this  problem.  Hence,  the 
time-varying  queue  previously  described  can  be  generated  from  trajectories  of  a  “standard” 
M/M/l/oo  queue.  We  should  caution,  however,  that  this  would  be  an  inefficient  way  of 
simulating  a  time-varying  queue,  since  a  (very)  large  proportion  of  the  “standard”  paths 
might  be  rejected.  Competing  methods  for  generation  of  sample  paths  for  M/M/l/oo  queues 
with  time-varying  arrival  and  service  rates  appear  in  Lewis  and  Shedler  (1979  a,  b). 


6.  IMPORTANCE  SAMPLING:  SECTIONS  2-4  REVISITED 

We  now  wish  to  apply  some  of  the  ideas  described  in  Section  5  to  the  processes  dis¬ 
cussed  in  Sections  2-4.  We  start  by  trying  to  determine  the  probability  Q  which  minimizes 
va tqLq  (see  (29)),  when  P  is  a  Markov  probability  associated  with  a  discrete-time  chain. 
Specifically,  suppose  our  task  is  to  estimate  a  =  EPf{X0,...,Xn ),  where  EP(  )  is  the  expec¬ 
tation  on  the  path  space  of  X  associated  with  initial  distribution  n  and  transition  matrix 
P.  It  is  easily  verified  that  the  probability  Q  is  then  given  by 


Q(Af0  =  *o,  ••  • ,  Xn  =  »„} 


l/(*o, 


n—  1 

•  •  >  *n)  |m(*'o)  I"J  P(*'ji  */+l) 

_ £0 _ 

EP\f(X0,..,,Xn\ 


(30) 


Clearly,  Q  is  not,  in  general,  itself  a  Markov  probability,  in  the  sense  that  there  will  usually 
exist  no  initial  distribution  v  and  transition  matrices  K}-  for  which  Q{X0  =  *0, . . . ,  Xn  -  in  }  = 
^ojriyJo  *>(*>•*>+ 1)-  Thus,  the  minimal  variance  probability  Q  has  the  additional  disad¬ 
vantage,  beyond  those  discussed  in  Section  5,  that  Q  will  generally  be  non-Markov.  The 
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non-Markov  structure  of  Q  suggests  that  generation  of  sample  outcomes  under  Q  will 
frequently  be  (very)  computationally  intensive.  A  similar  conclusion  holds  for  continuous¬ 
time  chains  and  GSMP’s. 

There  is  one  important  setting,  however,  where  the  optimal  measure  Q  is  in  fact 
Markov.  Consider  the  problem  of  estimating,  for  a  Markov  chain  X  and  subset  A,  the 
parameter 

a  =  Pp(T(X)>m|Jf0  =  *} 


for  t  JkA,  where  T(A)  =  inf{n  >  0  :  Xn  e  A}.  Notice  that  a  =  EpI(T(A)  >  m)  where  I(T(A)  >  m) 
takes  the  multiplicative  form 


I(T[A)  >  m)  =  I]  *  A) 

fc=0 

Then,  (30)  takes  the  form 

m  —  1 

<2{X0  =  to,  •  •  •  i  Xn  =  tn}  ~  6i'i0  (J  Kj (ty , tj+ 1 ) 

y=o 


where 


KAi.k)  =  P(i,k) 


Pp{T(A)  >  m  —  j  —  l[Xp  =  k} 
Pp{T(A)>m-j\X0  =  k} 


for  t,  k&  A. 

For  a  birth-death  process  X  in  which  A  =  {6  +  l,b  +  2,...},  it  is  evident  that  PP{T(A)  > 
m-l|Ao  ==  »'  —  1}  >  PP{T(A)  >  m\X0  =  i}  so  Ay(t,t'-1)  >  P(t,t-l)  for  all »'  and  j.  In  other  words 
the  optimal  measure  Q  increases  the  likelihood  of  drifting  to  the  left.  It  is  interesting  to 
also  note  that  Kj(i,k)  =  P{i,  k)  if  |6  - 1  |  >  m  -  j. 

Further  information  can  be  obtained  by  analyzing  the  asymptotic  behavior  of  PP{T(A)  > 
m\Xo  =  k}.  Note  that 


PP{Xm  =  t,T{A)  >  m\X0  =  k}  =  GTt 


where  G  =  (Gkt  :  k,UAc)  is  the  matrix  in  which  Gkt  =  P(k,t)-  Suppose  X  is  irreducible,  Ac 
has  finite  cardinality  r,  and  G  is  diagonalizable.  Then, 


G  =  £~ldiag(Ai,.. . , \r)B 

is  the  spectral  decomposition  of  G,  where  |Aj|  <  l  (this  follows  from  the  irreducibility).  If 
|Ai|  >  |A2|  >  •  •  •  >  |Ar|,  then 

A -"»Gm  -G*  =  £-1diag(l,0, 
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(note  that  At  =  |Aij  by  the  Perron-Frobenius  theorem  for  non-negative  matrices).  Thus,  it 
follows  that 

GTi  ~  Git  ■  *?■ 

Under  these  conditions,  one  finds  that  Kj (»,  A:)  -«  P(»,fc)  A^1  gIi/12ua‘  Git •  Hence,  the 

optimal  measure  Q  is  asymptotically  (for  luge  m)  time-homogeneous  Markov,  in  the  sense 
that  Kj(t,k)  -*  K(i,k)  as  m  -►  oo  (for  j  fixed). 

The  above  arguments  suggest  that  it  is  frequently  reasonable  to  reduce  the  vari¬ 
ance  by  choosing  a  sampling  distribution  which  minimizes  vsiqLq  over  the  class  of  time- 
homogeneous  Markov  sampling  distributions.  Suppose  that  we  wish  to  estimate 


a  =  EPf(X o . Xn ) 

where  X  =  {X„  :  n  >  0}  is  a  discrete-time  chain  with  transition  matrix  P  and  initial  distri¬ 
bution  n.  If  Q  is  the  Markov  probability  corresponding  to  transition  matrix  K  and  initial 
distribution  u,  then 

Z*  —  f2lX  X  nxj,xj+l)2  a 

qLQ  -  Eqf  (X0, ... ,  Xn)  11  K(x^  Xj+i)2 

=  J2  M*o,.  -,*n)v(*o)-1  n  -a2 

•o.»'i . •'»  i=° 

where  h(t0,...,tn)  =  /2(to,...,tn)/i2(to)IIy=o  P2(*i,*i+i).  To  find  the  minimizer  K’,v'  for 
wcqLq,  we  observe  that  the  n2  +  n  variables  {K(i,j),  u{i)  :  1  <  i,j  <  n)  satisfy  the  con¬ 
straints  K(i,j)  >  0,  i/(t)  >  0,  £"=1  X{i,j)  =  1  (1  <  *  <  n),  £“=1  *'(j)  =  1-  Applying  the  method 
of  Lagrange  multipliers,  we  find  that  the  optimizer  K*  ,v'  satisfies 

V"  _ M«0 . *n) _  l/*(*o)~1 


. . . 

(».j)€  path 


II  A*(u,*k+i)  . *',+1 


=  Ai,  1  <i,j<n 


i/*(t)  =  j/(t)/^v(y),  1  <t<n 
;'=i 

where  the  A,  ’s  are  the  Lagrange  multipliers  and 

n(to,  •  ■  ■  ,*n)  =total  number  of  times  (i,;) 

appears  in  path  (i0 . »n) 

1/3 


*(0  = 


r»—  1 


<l . Jr*(»'.M)II **(•'>•<>+») 

y=i 
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By  multiplying  through  the  (t,;)’th  equation  by  the  corresponding  denominator,  we  see 
that  computing  the  Jf*(t,y)’s  involves  solving  a  large  number  of  coupled  equations  which 
are  multinomial  in  the  optimal  probabilities  As  a  consequence,  we  conclude  that  it 

is,  in  general,  unrealistic  to  expect  that  one  can  solve  explicitly  for  the  optimal  Markovian 
sampling  distribution  Q *. 

A  second  important  concept  discussed  in  Section  5  is  that  the  likelihood  ratio  L  can 
be  used  as  a  control  variate.  In  fact,  for  the  processes  of  Section  2-4,  control  variates 
arise  naturally  when  importance  sampling  is  applied.  To  illustrate  this  idea,  we  focus  on 


GSMP’s.  We  first  observe  that 


j  li  en) 

H  1  n+i;i4Ble;) 


*(<?»+ 1(«);  ^n+l»  )Mo,  Co,  •  •  •  >  ^n)  On  }  —  1? 


1/'  . ’  «€N(a.+  1.a..«;)  ) 

this  in  turn  implies  that  for  fc  >  0, 

En{Ln+k(F,H)\A0,C0,..  ,An,Cn)  =  Ln(F,H).  (31) 

(31)  states  that  {Ln(F,H)  n  >  0}  is  a  martingale  sequence  when  generated  under  Ph{  )- 
These  simple  observations  guarantee  that  EHDn(i)  =  0,  where  D„(i)  =  (L„+i(.F,  H)/Ln(F,  H))  - 
1.  Hence,  any  of  the  r.v.’s  {£>„(»)  :  n,»  >  0}  may  be  used  as  controls. 

Even  more  controls  can  be  constructed  when  the  estimation  problem  takes  the  form 


“  * Er  {/„ 


h(X(3))ds 


for  some  function  h,  with  r  a  stopping  time.  In  this  case,  Section  4  (see  (23),  for  example) 
shows  that  EHM  -  0,  where 

M  =  £(MAt)Ct(en/«-(A<,eJ))(Lr(F,/f)  -  Lt(F,H)). 

1=0 

In  fact,  even  more  is  true.  Each  of  the  summands  appearing  in  M,  namely 


Mt  =  (h(At)Ct(tl)/r(Ae,  e‘t))(Lr(F,  H)  -  Le[F,  H))I( r  >  l) 


have  vanishing  expectation  with  respect  to  EH ;  this  follows  immediately  from  the  martin¬ 
gale  property  (31).  Since  the  single  state  simulation  strategy  of  Section  4  involves  a  ratio 
of  expectations  of  the  form  (32)  (for  the  denominator,  set  h  =  l),  we  see  that  the  controls 
arise  naturally  when  steady-state  simulation  for  GSMP’s  is  carried  out. 

It  is  to  be  expected  that  the  control  variates  described  here  will  yield  at  least  a  moderate 
increase  in  the  efficiency  of  importance  sampling  methods  for  GSMP’s. 
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7.  IMPORTANCE  SAMPLING:  THE  PARAMETRIC  THEORY 

With  the  exception  of  our  brief  discussion  of  the  probability  distribution  Q  in  Section 
5,  we  have  concentrated  primarily  on  describing  the  basic  implementation  of  importance 
sampling  in  a  stochastic  simulation  context,  without  saying  much  about  how  to  actually 
choose  the  new  sampling  distribution  Q.  In  this  section,  we  briefly  address  this  difficult 
problem.  We  shall  assume  that  the  goal  is  to  estimate  a  =  EpX,  when  the  probability 
distribution  P  can  be  embedded  in  a  parameterized  family  of  distributions  P»  (i.e.  P  =  Peo 
for  some  90  in  the  parameter  set).  Our  goal  is  to  select  6 *  to  maximize  the  efficiency 
of  importance  sampling.  By  the  central  limit  theorem  of  Section  5,  this  is  equivalent  to 
requiring  that  we  choose  9'  to  minimize 

tr2(0)  =  Ee(3  ■  var gXL(9), 

where  £#(•)  is  the  expectation  corresponding  to  Pe  and  L(9)  is  the  likelihood  ratio  of  P„0 
with  respect  to  Pe. 

To  give  some  idea  of  the  large  variance  reductions  possible,  consider  the  problem  of 
estimating 

a  =  EPf(X . . Xr{A)) 

where  r(A)  =  inf{n  >  0 :  XncA}  with  A  -  {i  +  1,...}  and  X  is  a  Markov  chain  with  transition 
matrix  P.  Suppose  P  takes  the  form 

PM  =  Si, 

P(i,j)  =  pSi.i- 1  +  qSi,i+ 1  for  »  >  l. 

Such  a  P  can  be  embedded  in  the  parameterized  family  P(9),  where 

P(0,O,j)  =  6u 

P(S,i,])  =  96i_i-i  +  (1  —  for  *  >  1. 

Note  that  EeP  is  independent  of  9.  Since  the  chain  at  time  r(A)  has  taken  6  -  i  more  steps 
to  the  right  than  steps  to  the  left,  it  follows  that 

Thus,  if  9  =  q,  L(q)  =  (q/p)b~Xo  and  hence 

var,{/(A0,...,  Ar(X))|A0  =  *} 

=  £,{/’(*. . . Ar(4))L3(g)|Ao  =  »}-a2 

=  (q/p)b-iEp{f2(XQ,...,Xr{A))\X0  =  *}  -  a2. 
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So,  if  p  »  q  or  b  > «,  the  potential  computational  savings  obtained  by  using  importance 
sampling  with  8  ■  q  are  enormous. 

This  example  indicates  that  the  additional  effort  required  in  finding  an  efficient  8  at 
which  to  simulate  can  have  a  significant  payback.  One  general  approach  for  accomplishing 
this  is  to  use  stochastic  approximation  (e.g.  the  Kiefer- Wolfowitz  method)  to  minimize 
<t3( 8 ).  Since  <r2[8)  may  be  infinite  over  a  substantial  subset  of  the  parameter  space  (see 
Section  8),  this  can  make  a  stochastic  approximation  algorithm  highly  unstable.  A  second 
general  approach  to  selecting  an  efficient  8  involves  using  the  mathematical  theory  of 
probability  to  select  a  8  which  will  be  close  to  8 *;  the  relevant  mathematics  needed  is  the 
theory  of  “large  deviations”.  (See  Billingsley  (1979)  for  a  brief  description.)  Consider,  for 
example,  the  problem  of  estimating 

a  =  a[w)  =  P{W  >  u>}  (33) 

where  W  is  the  steady-state  waiting  time  of  a  customer  in  a  GI/G/ l/oo  queue  having  arrival 
and  service  time  distributions  FA  and  Fs,  respectively.  Clearly,  if  w  is  large,  this  tail 
probability  will  be  expensive  to  calculate  to  a  reasonable  degree  of  relative  accuracy. 

We  now  embed  P  in  a  parametric  family.  Assume  that  FA,Fs  have  moment  generating 

functions  pA,ps  that  converge  in  a  neighborhood  of  zero.  Let 

FA{8,dz)  =  e9‘FA(dz)/pA(8) 

Fs(8,dx)  =  '••Fx(dx)/ps(ey, 

then  Fa(8),Fs(8)  are  the  distribution  functions  of  non-negative  r.v.’s,  and  can  be  viewed  as 
the  inter-arrival  and  service  lime  distribution  for  the  P»  system. 

To  select  the  best  possible  8 ,  let  p(8)  =  <ps(8)<(>a(-8)-  Note  that  <p  is  the  moment 
generating  function  of  the  difference  between  a  service  time  r.v.  and  an  (independent) 
inter-arrival  time  r.v.  Because  we  need  the  mean  inter-arrival  time  to  be  greater  than  the 
mean  service  time  for  the  queue  to  be  stable  and  (33)  to  make  sense,  it  follows  that  p  has 
a  negative  derivative  at  8  =  0.  Because  of  the  convexity  of  p,  it  is  then  evident  that  any 
root  8  ft  0  of  p(8)  =  l  must  be  positive  and  unique.  Assume  such  a  root  8X  exists. 

To  apply  these  ideas  to  (33),  we  need  to  use  the  well-known  fact  that 

“  =  '>{osm*¥‘.T,>“’}  |3<) 

where  T0  =  0\Tn  =  ££_,(Sk  -  ^4*),  and  the  Ak,St' s  are  independent  copies  of  the  arrival  and 
service  time  r.v.’s.  Note  that  (34)  is  equal  to 


a  =  P{r(w)  <  oo} 
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(35) 


where  r(u>)  =  inf{n  >0:Tn>w}. 

We  now  apply  importance  sampling  to  (35)  by  using  inter-arrival  and  service  time 
distributions  JU(*i)i/s(fi)  respectively,  rather  than  FA,FS.  Then 

P{r(ti>)  <  00}  =  Po{r(tc)  <  00}  =  EeiI(r(w)  <  oo)Lr(„)(tfi) 
where  IT(«)(^i)  is  the  likelihood  ratio  given  by 

It  is  easily  verified  that  Eet  (5i  -  Ai)  >  0  and  hence  r(w)  <  00  occurs  with  probability  one 
under  P6x.  We  conclude  that 

a(w)  =  E9lLT(w)(6 1).  (36) 

Siegmimd  (1976)  argues  that  importance  sampling,  based  on  P9x  as  in  (36),  is  asymptoti¬ 
cally  more  efficient  (as  w  -*  00)  than  any  other  choice  of  6  when  estimation  is  based  on  the 
representation  (35).  Asmussen  (1985)  shows  that  the  choice  continues  to  be  the  optimal 
choice  in  heavy  traffic  conditions  where  E0A  «  EoS.  The  efficiency  increases  that  axe  typical 
when  this  “large  deviations”  idea  is  applied  can  be  enormous;  Asmussen  (1985)  reports 
efficiency  increases  on  the  order  of  a  factor  of  3  to  a  factor  of  400.  Work  on  extension  of 
these  ideas  to  queueing  networks  is  active;  see,  for  example,  Cottrell  et  al.  (1983),  Parekh 
and  Walrand  (1986),  Weiss  (1986),  and  Anantharun  (1987).  This  area  of  work  holds  great 
promise  for  development  of  importance  sampling  methods  permitting  substantial  efficiency 
increases. 

8.  RESPONSE  SURFACE  ESTIMATION  USING  IMPORTANCE  SAM¬ 
PLING 

In  many  applications,  a  parametric  embedding  of  the  type  described  in  Section  7  arises 
naturally.  For  example,  in  the  study  of  queueing  systems,  one  is  frequently  interested  in 
understanding  the  behavior  of  the  process  as  a  function  of  the  probability  law  Pe  imposed 
on  the  queue,  where  P»  corresponds  to  the  queue  dynamics  under  service  rate  6.  Since  the 
family  {P#}  so  obtained  often  satisfies  (26)  (i.e.  Pe0(A)  >  0  implies  P«l(A)  >  0  for  0O  /  0i), 
importance  sampling  seems  natural  in  this  context.  In  particular,  one  can  then  estimate 
the  expectation  of  X  under  Pe  by  simulating  X  under  P»0  for  some  fixed  60.  This  suggests 
that  one  cam  use  importance  sampling,  under  distribution  P»o,  to  simultaneously  estimate 
the  entire  response  function 

a{6)  =  EP,X. 
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In  a  queueing  context,  this  idea  raises  the  possibility  that  one  can  estimate  queue 
performance  over  an  entire  interval  of  service  time  parameters  0  by  simulating  the  queue 
at  a  particular  0O  and  applying  importance  sampling  appropriately. 

To  describe  the  idea  precisely,  let  X  be  a  real-valued  r.v.  defined  on  a  sample  space  Cl 
(with  <r-field  7).  Suppose  {Pe  :  0  e  [a,  6]}  is  a  family  of  probabilities  on  (Q,  7)  which  satisfy: 

>  0  implies  Q(A)  >  0 

(37) 

for  every  A  e  7  and  0  e  [a,  6]. 

In  many  applications,  (37)  is  satisfied  with  Q  =  P»  o,  for  some  80  e  [a,  6).  Then,  it  follows 
from  Section  5  that  for  every  d  e  [a,  6],  there  exists  L(0)  such  that 

Ep,X  =  EqXL(8). 

Hence,  if  Ep,\X\  <  oo  for  all  8  e  [a, i],  it  follows  that 


a(8)  =  EqXL(8).  (38) 

To  estimate  the  “response  function”  <*(•),  (38)  suggests  that  one  generate  independent 
replicates  {XiLi(8) :  8  e  [a,  6]}  of  the  random  process  {XL(8) :  8  e  |a,6]}.  The  function  a( )  can 
then  be  estimated  via 

M-)  =  (39) 

fc=i 

To  illustrate  this  ide,a,  we  now  specialize  to  the  discrete-time  Markov  chain  setting. 
For  each  8  e  [a,  b],  let  P{8),  n(8)  be  an  associated  transition  matrix  and  initial  distribution, 
respectively;  the  parameter  8  might  correspond,  for  example,  to  a  branching  probability,  in 
which  case  P(8)  is  the  transition  matrix  associated  with  branching  probability  8.  Suppose 
that  the  goal  is  to  estimate  the  expectation  of  f[X0,X lt...,Xn)  under  transition  matrix  P(8) 
and  initial  distribution  *i(0),  as  a  function  of  0.  If  A(0)  =  {(*,y,Jb)  :  P(8,i,j)  >  0,  n{8,k)  >  0}  is 
independent  of  8,  then  we  can  estimate  <*(•)  via 

«»(■)  =  z  E  o-  -  >  Xkn)Lnk(P(  ),  P(80))  (40) 


where 


i  (Pte)  Pie ))  p(6’xko)  Vr  P(0,xk]JXiti]+1) 


and  {(Jf*oi-yki..-.,Xfc„)  :  k  >  0}  is  a  sequence  of  i.i.d.  copies  of  (X0, ^.....X™)  generated 
under  Pe„.  The  attractive  feature  of  this  estimator  is  that  the  global  behavior  of  «(■) 


may  be  estimated  by  simulating  the  Markov  chain  locally  (at  So),  so  that  one  avoids  the 
enormous  computation  involved  in  simulating  the  system  at  a  grid  of  pionts  Si.Sj ,...,9m. 

While  it  is  true  that  <s„(S)  -»  a(tf)  a.s.  for  each  0  e  [a, 6)  if  Ep,\X\  <  oo,  one  is  generally 
more  interested,  in  the  current  function  estimation  context,  in  studying 

en(a,b)=  sup  |fi„(0)  -  a(tf)|. 

#€  [a,  6) 

The  quantity  en(a,b)  describes  the  global  behavior  of  <s„(  )  as  am  estimator  of  the  function 
a(-)  over  the  interval  (a,  b). 

Theorem  0.  Suppose  that  L(0)  is  continuous  in  9  a.s.  on  [a,  6],  so  that  M  =  M(a,b)  = 
max{L(0)  :  9  €  [a, 6]}  <  oo.  If  £|.Xj  M  <  oo,  then  en(a,b)  — ►  0  a.S.  as  n  — ♦  oo. 

This  result,  which  states  that  a„(  )  approximates  a(  )  uniformly  well  over  the  interval 
[a,  4],  is  an  immediate  consequence  of  Theorem  8.1,  p.  256,  of  Parthasarathy  (1967). 

Returning  to  the  Markov  chain  setting  described  above,  assume  that  P(8),n(e)  are  con¬ 
tinuous  in  9.  Since  /  is  automatically  bounded  (S  is  finite) ,  it  follows  that  E\f(X0, Xi,..., Xn) \ 
M  <  oo  for  every  n  >  0  and  all  intervals  (a, 4].  Hence,  <sn(  )  — *  <*(■)  uniformly  on  bounded 
intervals  a.s. 

The  situation  is  more  complicated,  however,  when  steady-state  estimation  is  involved. 
As  discussed  earlier,  we  suggest  using  regenerative  simulation  in  this  case  (see  Proposition 
2). 

As  discussed  in  Proposition  2,  it  is  evident  that 

, ,  =  EP{f,o){YLT(P(  ),P(9o))\Xo=i}  = 

U  £p(9o,{rLT(P(  ),P(e0))(Xo=t}  "  l(  ) 

To  estimate  <*(•)  uniformly  well  over  an  interval  [a,  4],  it  suffices  to  uniformly  estimate  “(MO 
over  the  interval.  To  apply  Theorem  6  to  u( ),  observe  that 

|K|  •max{Lr(P(*),.P(0o)) :  9e[a,b)}  <  ||fc||  •  rK(a,  b)T  (41) 

where  ||A||  =  max{|h(t)|  :ieS }  ( h  is  the  steady-state  functional  under  consideration),  K(a,b)  = 
ma x{P{6,i,j)/P(90,i,j)  P{80,  i,j),  9e\a,  4]}.  Note  that  the  expectation  of  the  right-hand  side  of 
(41)  is  just  Aj(A(a,4))||/i||,  where  K>(*)  =  EP[9<)){rzT |X0  =  j}. 

Proposition  6.  If  o  <  52^  P(9o,],k)  for  all  ;  e  5  (S  finite),  then  there  exists  *0  such  that 
=  oo  for  a  >  *o- 
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Thus,  if  K(a, b)  >  *o,  the  bound  (41)  becomes  infinite.  This  suggests  that  if  K(a,b)  is 
large,  *„(  )  may  not  approximate  <*(•)  uniformly  well  over  [a,  6].  Of  course,  K(a,b)  will  tend 
to  be  large,  for  most  parameters ations  P(0),  when  b-a  is  big.  Hence,  steady-state  response 
function  estimation  appears  to  behave  poorly  when  the  interval  [a,  b]  under  consideration 
is  too  large. 

This  conclusion  is  also  suggested  by  the  following  result.  For  the  proof,  see  the  next 
section. 

Proposition  7.  If  inf  :jeS}>  1,  and  >  0  for  all  j,  then  \aiP(go]LT(P(6), 

P(M)  =  oo. 


9.  COMPUTATIONAL  AND  EMPIRICAL  RESULTS 

Whenever  new  simulation  methodology  is  proposed,  it  is  always  helpful  to  be  able  to 
find  the  theoretical  values  for  quantities  being  estimated.  In  this  section  we  shall  discuss 
the  relevant  theoretical  values  in  the  context  of  the  steady-state  estimation  problem  for 
discrete-time  Markov  chains. 

Suppose  that  X  =  {*„  :  n  >  0}  is  an  irreducible  finite-state  Markov  chain  with  transition 
probability  matrix  P  =  {P(*,y)}  and  state  space  E  =  {0,  l, . . . ,  IV}.  Assume  we  wish  to  estimate 

l  n~l 

a  =  Ep  lim  -  T  h(X") 

"  fc— o 

for  some  given  h  :  E  M.  As  indicated  in  Section  2,  this  problem  can  be  attacked  using 
the  regenerative  method.  We  shall  use  the  state  0  as  a  “return”  state  in  the  regenerative 
method  and  set 

r  =  inf  {n  >  1 :  Xn  =  0}  and 

y-Ew 

*=0 

Standard  regenerative  theory  yields  a  =  Ep{Y\X0  =  0}/Ep{t\Xo  =  0}.  If  we  now  switch  to 
a  new  transition  matrix  K  (satisfying  the  conditions  below  equation  (7)),  we  see  from 
Proposition  2,  that  we  also  have 


=  EK(YLT  1*0  =  0} 

:  EK{rLT\Xo=0}’ 


where  Lr  =  lr(P,K)  as  in  Proposition  1.  From  Proposition  1  we  know  that 

£r{y|Xo  =  o}  =  £K{rLr|Xo  =  o}  and 


EP{r  \X0  =  0}  =  Ek{tLt\X0  =  0}. 
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Let  P  denote  the  matrix  obtained  from  P  by  replacing  P(0,  j)  by  O  for  all  j  €  E.  Then  from 
standard  theory  (Hordijk,  Iglehart,  Schassberger  (1976)),  we  know  that 

EP{Y\X0  =  0}  =  [(/-  of>)"1A]0  and 
£p{r|Xo  =  0}  =  l(J- 

where  <s  is  the  vector  of  all  ones,  h  is  a  vector  with  components  h(i)  (i  €  E),  and  [  ]0  denotes 

the  0  component  of  the  vector. 

To  compare  the  efficiency  of  estimating  a  under  K  versus  P ,  we  use  Theorem  2.  Assume 
that  the  computational  effort  to  generate  a  cycle  is  proportional  to  the  length  of  the  cycle, 
so  p{i)  =  r.  The  variance  of  the  limiting  normal  when  a  is  estimated  under  P  is  given  by 

EP{(Y-ar)*  |*o  =  0} 

EP{r \X0  =0}  {  1 

whereas  if  a  is  estimated  under  K ,  using  the  ratio  representation  given  by  Proposition  2, 

then  the  appropriate  variance  is 

EK{(Y  -  ar)2L2(P,  K)\Xq  =  0}  Ek{t\X0  =  0} 

( Ek{tLt(P,K)\Xo  =  0 })»  1  1 

To  calculate  the  four  terms  appearing  in  (42)  and  (43),  only  the  term  Ek{(Y  -  or)2 

L*(P,K)\X0  =  0}  presents  something  new.  (Recall  that  Ek{tLt(P,  K)\X0  =  0}  =  Ep{t\X0  =  o}.) 

The  other  three  terms  can  be  handled  using  the  methods  in  Hordijk,  Iglehart,  Schassberger 

(1976).  Calculating  this  new  term  is  a  formidable  job  and  in  fact  it  may  be  equal  to  +oo. 

We  begin  by  computing  the  quantity 


aj  =  EK{L2T(P,K)\X0  =  j). 


Note  that 


a,  =  ^2ek{L2(P,K)I(t  =  l)\X0  =  ]} 


y*  P3(»o,»i)  P2(u  i,»t) 

i0_<e  A(*o,*i) 

•o=l.  *1=0 

*,*o,  i<j<e 

=  E(c'-^)i 

e=i 

where  C  =  (Ckt),b  =  (6,)  and  Ckl  =  P‘2(k,l)/K{k,t)  for  t  /  0,CfcO  =  o,bt  =  P2(t,0)/ K{1,0).  Hence, 
we  can  compute  the  a/s  by  solving  for 

a  =  f;C'6. 
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Note  that  a  satisfies  a  =  Ca  +  6.  If  a  is  finite,  then  (7  -  C)a  =  6,  so  that  if  /  -  C)~l  exists,  a 
must  equal  (7-C)-16.  Thus,  even  when  ( I-C)~l  exists,  we  are  still  faced  with  the  problem 
of  distinguishing  when  a  is  finite,  so  that  a  =  (/  -  C)~1b. 


Proposition  8.  Assume  C  >  0,a  =  C*b,  and  (7  -  C)~l  exists.  If  (/  -  C)_l  >  0,  then  a  is 
finite  and  equals  (7  -  C)_16.  Furthermore,  (7  -  C)_l  >  0  if  and  only  if  the  spectral  radius  of 
C  is  less  than  one. 

A  sufficient  condition  for  the  spectral  radius  of  C  to  be  less  than  one  is  that 

N 


max 

i‘E 


E  C3k 


<  1 


or,  in  our  setting 


&  K^k) 


Similarly,  it  is  easily  verified  that  a  sufficient  condition  for  1C*  o  Cr  =  oo  is  that 


or 


N 

E  c>k  >  1 

3  k=0 


'•e  £5  *<>,*) 


This  observation  basically  proves  Proposition  7. 

To  compute  the  variance  term  (43),  we  need  to  compute  the  term 


-  Ek  |  (jt  ?(^))  Lr(p’  *)!*>  = 


(44) 


where  g(-)  =  h(  )  -  a.  Expanding  the  square  in  (44)  and  conditioning  on  the  value  of  X, 
leads  to 


d,=g7U)EK{L2T(P,K)\Xo  =  ]} 
+ 

P2U,k) 


{ Zff(Xr)lUP,K)\X0  =  k} 

+  E 


k*  0 


K(j,k) 


dk. 


So,  d  satsifies  4  =  6'  +  Cd,  where  CM  -  P*[k,l)/K(k,l )  for  l  ?  0 ,CK0  =  0,  and 

=g7U)EK{L2(P,K)\Xo  =  j) 

+  Mi)  E  1$$Ek  {£  s(Xr)L2(P,  K)\x0  =  *  j  • 
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This  system  of  linear  equations  has  the  same  structure  as  that  derived  for  the  EK{L2(P,  K)\ 
X0  =  >}’s  discussed  above.  Finally,  an  analogous  system  of  equations  can  be  obtained  for 
the  unknown  term 

EKyEg{X.)L*[P,K)[Xo-k 

lr-0 

appearing  in  i'.  The  equations  for  the  d,* s  can  then  be  solved  to  calculate  the  variance 
term  (43). 

At  this  point  we  have  only  developed  one  numerical  illustration  of  importance  sam¬ 
pling  for  discrete-time  Markov  chains.1  The  original  P  matrix  (corresponding  to  an  (*,S) 
inventory  model)  is  given  by 


f  I/3 

0 

0 

2/3- 

1/3 

1/3 

0 

1/3 

1/3 

1/3 

1/3 

0 

0 

1/3 

1/3 

1/3 

with  state-space  E  =  {1,2, 3, 4}.  We  were  interested  in  simulating  to  estimate  n4  =  0.347838. 
The  K  matrix  used  in  the  importance  sampling  is  given  by 


0.476 

0 

0 

0.524' 

0.311 

0.476 

0 

0.213 

0.206 

0.319 

0.475 

0 

.  0 

0.233 

0.233 

0.534. 

Based  on  a  simulation  of  3000  4-cycles,  we  found  an  estimate  of  0.319  (0.540)  for  the  term 
Ef{(Y  -  ar)2 |X0  =  4}  ( EK{(YLr  -  <xtLt)2\X0  m  4).  We  can  compute  Ek{t\X0  =  4}  (EP{r\X0  =  4) 
to  be  2.8749  (3.0625).  This  leads  to  a  value  for  the  ratio,  R 2,  of  (43)  to  (42)  of  0.745.  Using 
the  alternate  method  proposed  in  Proposition  3  leads  to  a  value  for  the  corresponding  R2 
ratio  of  1.278.  As  indicated  in  Section  2,  the  asymptotic  efficiency  of  importance  sampling 
does  depend  on  the  return  state  used.  For  this  example,  the  R 2  values  for  the  first  method 
using  states  1,  2,  and  3  respectively  were  1.615,1.568,  and  2.153.  For  the  second  method 
the  corresponding  values  were  1.822, 1.729,  and  2.069.  The  fact  that  these  values  were  larger 
than  l  should  not  be  surprising,  since  the  K  matrix  was  specifically  chosen  to  do  well  for 
the  return  state  4. 

In  conclusion,  we  need  to  obtain  much  more  computational  experience  with  importance 
sampling.  In  particular  better  methods  for  selecting  the  Q  matrix  are  needed. 


1  We  are  grateful  to  Scott  Schulz  for  providing  this  numerical  example. 
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APPENDIX 


Proof  of  Proposition  1:  Note  that 


EpJ(X . . XT )  =  ,...,Xk)I(T=k)  (XI) 

k-0 

Since  T  is  a  stopping  time,  f(X0,...,Xk)I(T  =  k)  =  fk{X0 . Xk)  for  some  function  /*.  By 

(8), 

EPfK(Xo,...,Xk)  =  Ekfk(X0t...,Xk)Lk(P,K)  (X2) 

Combining  (Al)  and  (A2),  we  obtain 

oo 

EpJ(X o,  •  •  • .  XT)  =  52  Ek  f(Xo . Xn)I(T  =  n)Ln(P,  K) 

nsxO 

=  EKf(X . . Xt)Lt(P,K). 

Proof  of  Proposition  8.  Let  »?>(*)  =  =  j}.  By  conditioning  on  Xi ,  we  find 

that 

*>>(*)  =  *P{0o,3,i)  +  *'52P(0o,j,k)v>k(z) 

i.e.  <p(z)  =  a(*)  +  C(*)v>(a).  Since  C(s),  y>(*),a(*)  >  o  for  2  >  0,  it  follows  that  <p(z)  > 
(*)<*(*)•  Also,  for  s  >  20  =  l/min,  P[$0,j,k), 

min  Ec^w  > 1 

3  k 

which  implies  that  0(2)*  -  00  as  k  -  00.  Since  0(2)  >  0,  it  follows  that  *?(*)  =  00  for  2  >  z0. 

Proof  of  Proposition  8.  Since  C  >  0,  it  is  evident  that  (/  -  C)  =  I  -  C"+l  <  I. 

Since  (/-C)"1  >  0,£;=oC>  <  (7-C)"1.  By  letting  n  -  00,  it  follows  that  <  {I-C)~l. 

Hence,  a  <  00  and  (as  argued  in  Section  9)  a  =  (/  -  C)_16. 

This  argument  also  proves  that  if  (/  -  C)~l  >  0,  then  C3  —  0  and  j  —  00;  the  fact  that 
C3  —■  0  is  equivalent  to  the  spectral  radius  of  C  being  less  than  one.  For  the  converse,  note 
that  if  C3  -♦0,  it  must  necessarily  tend  to  zero  exponentially  fast,  so  that  0  <  £°L0C>  <  00. 
But  it  is  easily  verified  that  0C»'(/  -  C)  =  7  so  that  (/  -  C)-1  =  £”0  C3  < 
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